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INTRODUCTION 

The papers In this monograph address an Issue of Importance to 
educational policy and practice: the use of testing and evaluation to 
assess the quality of education and to facilitate school Improvement. 
Authors consider the traditional role that testing has played In 
accouhtablllty and the role that assessment and evaluation can and should 
play In Improving teaching and learning; they point out some of the . 
problems and limits of current evaluation practices and call for new 
approaches that will broaden perspectives on schooling and contribute to 
the usefulness of the evaluation enterprise. 

The papers are drawn from "Wagging the Dog, Carting the Horse: Testing 
vs. Improving California's Schools," a conference sponsored jointly by the 
UCLA Center for the Study of Evaluation and the UCLA Laboratory In School 
and Community Educ:a1on, both units within the Graduate School of 
Education. The conference was held on .June 7, 1984 and attracted a diverse 
audience of over 100 educational practitioners, policy-makers and 
researchers. The conference contributed substantially to promoting 
dialogue and communication among these various groups, interactions which 
help to bridge the gap between research and practice. 
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EVALUATING EDUCATIONAL QUALITY: A RATIONAL DESIGN 

Eva L. Baker 
UCLA Center for the Stu<<y of Evaluation 
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The world Is too capricious for us to accept It as ItN^, So 
for psychological as well as pract1c%3 reasons vfe have con» to\ 
believe that we can Influence the course of events. Large number\of 
people during great epochs of history did not so bellevie. Whatever^ 
occurred was accepted as predestined either because of an lUnknowable 
master plan, or as a consequence of behavior In fanner Incarnations* 

Times have changed. Reasoning and thought have come to have 
specific uses In the .information-driven society of the present. We 
want to be rational so we can believe that we- understand and control 
events. We plan. We Implement. We assess. Then we try to learn 
from experience and plan better next time. The evaluation process, 
In schools and elsewhere. Is based upon this view of the world. A 
corollary to this perspective Is our focus on goals and standards. 
If we have a clear Idea of what we want, communicate It well to all 
actors, and have a criterion for judgment, then v« should not only 
see change, but the change should be In the direction Intended. 
Obvious stuff for school people who have had a surfeit of experience 
with goals, objectives, and standards. It should be easy; It should 
work. It Isn't, and It doesn't very often. The purpose of this 



paper Is to describe what goes wrong with the good idea of evaluation 
for school Improvement and to suggest some possible remedies. 
The Problem 

Schools have had years of experience with evaluation. Sometimes 
the function has been called testing, grading, standards or assess- 
ments and has been applied to student performance. Later, these 
activities were directed at programs as well as at people. So school 
people are not newcomers to the evaluation, although they «»y nei- 
ther know nor particularly care about the newest name applied. They 
have experienced waves of equl^, quality. Improvement, In crite- 
rion-referenced, norm-referenced, goal -oriented, decision- theoretic, 
responsive, goal-free. Illuminative, descrepant, creative, and 
transactional methods and are now caught up In a wave of epicellence. 
In fact, academics have perpetuated untold numbers of evaluation 
models and measurement approaches, activity appropriate for our 
personal Incentive structure. Unfortunately, the one model we want 
remains elusive: effective. Why doesn't evaluation work the way we 
think It should? One might say that our expecta|1ons are too high or 
that the technology Is weak, without much Impact. The spec tor of 
bumbling educators, clinging precariously to the lowest range of the 
SAT (as national magazines report our performance), suggests another 
explanation: maybe we haven't been smart enough to figure it out. 
Makers of policy have acted on that belief and attempted to place the 
evaluation of learning In the hands of presumed betters: technicians 
who use either cost-benefit formulae or psychometric ally elegant 



models, and sometimes both. Incidentally, most of them have struck 
out as well. 

Backing up and tak1ng*a simple view (how appropriate!) we might 
redefine the problem. On the one hand, everyone knows and even 
social science "research supports the idea that information can be 
used to make Improvements In programs of various sorts. A number of 
conditions must be met, however. First, the Information has to be In 
a usable form, so that long trains of hypotheses and inferences may 
be avoided. Second, the Information should be available to those 
Individuals who are responsible for Implementing changes. Obviously, 
the Information should be timed so that changes can beSnade as 

needed. ' ^ 

Third, Information shotrfd be valid. , It should provide an accu- 
rate picture of the matters of Interest. (Validity does not neces- ^ . 
sarlly Imply precision, a topic to which we ^111 return later.) In 
addition, those charged with using th^ Information must find It 
credible. Credibility gets built tn many ways, through logic, or 
through association with authority mechanisms or persons (like 
experts). A strong way to build credibility Is to allow the end 
users to design, create or amend the character of the Information 
base, so that, in the metaphor of our economic system, they buy In, 
feel ownership, or Invest In the entire process. Although these 
points provide only a quick picture of requirements for Informations 
utilization, those of us Involved In educational evaluation can 
Identify Immediately their Implications: evaluation should be 



designed. Implemented, once more, arjd used at the principal unit of 
change— the school. Without 1tem1i\1ng reasons why school^ are the 
appropriate unit of change,^! et us accept that much goodVesearch and 
analysis have led to this perceptlpn. ( 

This analysis now aside, all of us know that evaluation acti- 
vities as they operate in most school districts are driven by a 
different reality. Evaluation is a process mandated from above and 
often from outside of the operational ntanagement of the educational 
enterprise. School boards, legislatures, and state education leaders 
have legitimate questions about the effectiveness of education. 
These questions involve management, staffing, quality of services, as 
well as those concerned with the more tradltlonaV outputs of educa- 
tion, such as whether students learn, what they learn, and the larger 
question of how well they are prepared to function In the world. 

A response to these two legitimate viewpoints implies at once 
that 1) evaluation should generate information useful at the point of 
change; and 2) evaluation should contribute to responsible oversight 
of the educational system. Thus, we find in these two views that 
premises, assumptions, present practices, and implications of 
evaluation seemingly conflict irreconcilably. 

Point of change evaluation emphasizes the special ness of each 
site, that Is, the unique character of each school, comprised of the 
particular staff, setting, students, and social context. Point of 
change evaluation implies recognition and attention to the particular 
personality of a given school. The evaluation effort needs to be 




sensitive to the teachers Involved, theiH experience, content and 
pedagogical expertise, views of their role, their stance toward their 
students and toward management. Clearly, point of change evaluation' 
should have good Information about students. Information which ex- 
tends beyond gross estimates of performance on commercial achievement 
tests or socio-economic status assignments. 

Among the strongest demands ^s that the evaluation Information 
be directed to matters of Importance and to those susceptible to 

> 

change. The particular content, goals, and learning problems facing 
the school should be reflected in the data collection strategies and 
In how progress is judged.. To expand for a moment on this particular 
point, one would expect that the wa^any important educational goal 
is treated would be Influenced greatly by Its practicality In a 
specific environment. For instance, many schools have identified 
comprehension in reading as a principal goal to focus effort. Yet, 
what aspects of comprehension are appropriate for a given school 
population, or even groups of children within the school differ 
dramatically. Comprehension for children at one school may mean 
basic parsing of meaning to , under stand the literal content of a 
sentence, whereas comprehension for other children might Involve 
relatively sophisticated inferencing. Both sets of staffs at the 
school sites may be working to capacity to Improve reading compre- 
hension. However, an evaluation or testing procedure that looked at 
absolute levels of performance would credit one school greatly over 
the other. Point of change evaluation would need to provide' 
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Information peculiar to each site so that the appropriate Instruc- 
tional consequences could be identified and applied. We wm return 
to some of the methodological and research Issues Inherent In this 
point of view later. ^ 

On the other hand,, a system useful for accountability and 
oversight demands almost a wholly different set of features. First, 
the database must have comparability so that contrasts among schools 
can be made. Second, the academic content areas of Interest must be 
those that either have high priority for the public or those for 
which policy decisions are required. Such jrequlrements Implicitly 
restrict the number of measures (or Indicators or constructs) em- 
ployed because political priorities and policy options are deflnl- 
tlonally constrained. A third feature of top-down assessment Is the 
more self-conscious emphasis on the connections among organizational 
units and subsystems, e.g., budget, staffing, management, instruc- 
tion. To summarize graphically top-down ( accountability) and 
bottom-up (point of change) evaluation features, consult Figure 1. 
This chart Is presented to ^<*e"^^y salient contrasts and overlaps 
between a top-down and bottom-up evaluation perspectives. A brief 
review of this Figure 1 Illustrates that demands of top-down and 
bottom-up evaluation overlap but also differ enormously. Such 
feature differences are also represented In reality by the deployment 
of multiple-data collection or evaluation projects. 
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CONTRASTS BETWEEN TOP-DOWN & BOT^OM-UP EVALUATION FEATURES 

Figure 1 
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These evaluation efforts occur In essentially disjunct ways. 
For example, a typical school district In California might have 
evaluation activities relative to a range of separate requirements: 
1) Superordinate demands; 2) District requirements (regular); 3) 
District requirements (special); 4) School imposed; 5) Classroom 
driven. 

Figure 2 
Types of Evaluation Demands 

1. Subordinate demands 

2. District regular 

3. District special ' 

4. School imposed 

5. Classroom driven 

At present, there are pitifully poor numbers of instances where any 
integration at all occurs among these different purposes. To expand, 
1) superordinate demands are triggered and include exogenously re- 
quir^ents for state assessment programs, participation in National 
Assessment, research projects, advance placement and other scholastic 
tests. 2) Regular district requirements may Include administration 
of one or more achievenent test batteries implementation of "tests for 
student certification, either high school exit examinations, grade- 
to-grade promotion examinations, or placement tests for identifica- 
tion purposes, e.g., special education, language deficits. 
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3) Special district evaluation efforts encompass those required for 
reporting to State and Federal agencies for special funding, any pro- 
gram specific assessment related to curriculum comparisons, the In- 
stitution of. new programs, and so on. 4) School Imposed requirements 
may be those Identified by the school as a particular planning goal, 
for Instance, to Improve written composition across the curriculum 
areas. 5) Classroom driven evaluation may Include those common- 
places required for a teacher to perform according to expectations, 
e.g., moving students around, assigning grades, having conferences, 
as well as those pertinent to meta- Instructional demands, e.g., 
checking to see If using a new set of workbooks was worth using 
again, self -assessing the quality of teaching, or trying to figure 
out a new way to deal with a common arithmetic problem the students 
have. Uses of Information at a classroom level must necessarily be 
specifically relevant to the options perceived as available by the 
teacher to move on, and within his/her capacity to achieve. Timing 
may either be on an Instantaneous fuse, "I need to reassign these 
students Thursday" or may for meta- Instructional analysts and be 
deferred until the next time the unit Is taught or shared with a 
colleague whose schedule Is two or three weeks slower. Notice that 
for these five different types of Information-driven applications we 
have focused almost entirely on student performance as the principal 
data source. It should be clear, however, that teachers' use of 
Information from relatively formal tests, even those which they 
design themselves. Is always aupiented, elaborated. Interpreted and 
modified by the wider sense they have about what students can 
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actually do- In the CSE study of test use (Herman, 1983), a ringing 
finding was that teachers don't pay much attention to traditional 
outcome measures as main Information sources for Instructional 
decision making. Why not? Teachers don't do so for a number of 
reasons. First, they ought to {but actually m not) be skeptical 
about the tests' validity, that Is, how close the tests come to 
measuring what teachers think they are teaching children. Secondly, 
the wen known problem of timing Is critical. Third, teachers have 
Informal ways of assessing comprehensively student performance, 
judging In-class behavior, homework, task-orientation, or student 
effoi^s on work other than standardized tests, and can draw upon the 
accumulated pattern of Information that they develop about a student, 
and take Into account Ideally students' rhythms In progress rather 
than from a one-time, cross-section sample of performance. 

Nonetheless, the actual practical database that teachers use can 
be regarded as either suspect (by pessimists) or open to Improvement 
(by the rest of us). The matter simply rests upon how credible the 
teacher is to be the single source through which a wide set of data, 
implicit criteria, and totally unrevlewed decisions get filtered. 
Were It were that we felt somewhat more certain about all teachers' 
competence to do this complicated job. But our teacher training 
programs have not taught than how, nor are they given many models or 
Incentives to take this part of their task seriously, and it Is, 
after all. difficult. Thus, the proposal for a top-down, bottom-up 
system Is designed to be an aid to teachers as well as a more formal 
mechanism to assess and subsequently to Improve educational quality. 



What 1 have tried to outline Is a complicated system that has 
nominally complicated Information demands. The reality Is such, 
however, that In most Instances decisions are made In the absence of 
formal Information and that the Information getting, displaying, and 
bemoanlfl^ are more ceremonial acts than Instrumental tactics. But 
let us return to the Ideology of rationality discussed at the outset 
of nv remarks. Information should help one get be'-ter at "x" 
enterprise. Schooling generates natural categories of Information 
that oug|t to feed Jntoi a decision process. How does It work now? 

To slightly exaggerate for dramatic effect, 1^ doesn't. For 
every '^•irpose Identified earlier, superordlnate, regular and special 
district requirements, and so on, separate and often many separate 
data collection activities (or evaluations or assessments) are 

i - 

conducted. Each of these costs money, adds one more ounc6 of general 
debilitation to the system, and hardly ever becomes Integrated with 
the normal demands of running. Improving, and satisfying the multiple 
clients of the schools. Since the Information Is rarely used, other 
than to rationalize a politically Inspired decision, (or for real 
estate Investors tfl^se when marketing homes near "good" schools.) 
the cost we ire Incurring Is Intolerable. Now, as an apposltlve. we 
can develop some cost figures on a per student basis for testing and 
evaluation, and on an absolute basis, these costs are not high. What 
1s worrisome Is that these costs take a substantial part of the mar- 
ginal funds our discretion, funds that might go for buying less out- 
dated books, or adding a teacher here or there. Thus, spending money 
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for superficial activities required by the political, arena annoys the 
Calvlnlst ancestors of ny friends. If not iny own. Puritanical yearn- 
ings aside, the political requirements for assessment, evaluation, 
and other Indicators of good management will continue. The trick 1$ 
to make them useful. 

Thus, as ny computer acquaintances are fond of saying, we have a 
top-down, bottom-up problem. Accountability looks top-down and 
drives the system based on needs for overall views of system opera- 
tion, logically. If sometimes not practically related to policy 
making. Bottom-up needs, the classroom In particular. Imply infor- 
mation access and use, but different kinds, at different polntsfor 
very different purposes. As of now, everyone gives tests and is 
Involved In the "evaluation process", but it often mere is role 
playing. We want to make It real; to make the money spent show up In 
high quality educational services and In student performance that we 
can be proud of. The probl«nspace (more computerese) of attention Is 
the Juncture or the Intersect between top-downess and bpttom-upness. 
Oo we start like the tunnel building children In the sandbox, burrow- 
ing first on one side then another, all the while hoping that their 
two hands meet in the unseen darkness? Perhaps a memorable 
analogy, but very bad evaluation plannlng.^^ How do we align, 
partition, adapt, and adjust Information needs and uses so that we 
produce the following? 

1. A real Information system, rather than the flotsam and 
Jetsom we call evaluation now. 

2. A system that Is efficient. 



3. A system that manages the. reconciniatloh of policy needs 
while maintaining the personality. Integrity, and idio- 
syncracy of given schools* 

4. An information base that will actually inform Instruction. 

Methods and Methodologies 

We will start with the unit as the school. First, because of 
the good research alluded to earlier that supports the school as the 
unit for change, and felicitously, school districU often make policy 
at school levels.,. Next we have to decide what goes in such a system, 
and those decisions should be reached based upon what plausible uses 
there are now for information. Clearly, there is every Justification 
for decisions for oversight, for public accountability. And surely, 
we want the particularistic time, person, and place bound problems to 
be addressed. Since we're creating something new. let's keep our 
^options flexible while a^ the same time pursuing a design solution..^ 
Let's agree on basic parameters of the effort and then look from one 
side (top-down) and then the other (bottom-up) to see how we're 
coming. 

Figure 3 

Givens for a Functional School Based Evaluation System 

1. School level a principal focus. 

2. Student performance essential. 

3. Comparability on some elements essential. 

4. Utilization at all levels. 
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Let's explore vrfth the Idea of a comprehensive with an expansion of 
the following features: some, of Its Information allows for cross- 
student and school comparisons * And, obviously there Is a technical 
basis for comparability of such data. Some elements of the system 
are demanded. There are no choices and those Indicators are identi- 
fled by policy actors or, to respond to particular d^ needs, by 
suprordlnate requirements. Let's also posit that the system has 
elements ifi It that allow for local option , quick turn-around , 
outcomes measured across time , multiple data sources on certain cri- 
tical dimensions. Let's also Include a place for quality of school 
life and quality of effort Indicators, some measures of Instructional 
resources and efforts and measures of process/outcome (depending upon 
perspective) Including affective measures . Indicators of parent/ 
community support , measures of collaboration and Integration among 
members of the school community* Also desirable measures of societal 
changes hitting the school, school specific Indicators on vandalism , 
absenteeism , transiency , changes in demography of student or teacher 
groups , ses , etc. 

Now what makes sense as a task design technique for such a 
comprehensive system? 

1. Re<pi1red features must be Identified, ideally useful at 
all levels. 

2. Slots for options need to be Identified with either 
sets of optional Indicators for any one slot, or open 
choices. 

3. Not all slots should or could be filled during any one 
school year. 
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4. Slots Should be filled so that longitudinal patterns 
Might be discerned (multlyear to catch longer term 
effects). 

5. Information overload needs to be avoided at all levels. 

6. Data collection and entry should be easy, and not time 
consuming* 

7. Principal users (main users) should be participants In 
design of system, generation. 

8. Procedures for sampling, decomposition and aggregation 
should be.lncludejdl so that least amount of data 
necessary i 

9. Let's not do the most sophisticated system we can; 
let's do the least that will worlj. 

Operational fairy tale version 1 . • ' 

Imagine a high school where the;' foil owing essential data sets 
are required: 1) Competency Test Based Scores entnr scores In 
reading and math. 2) Competency test scores on a district wide 
measures of reading and math. Blake high school also keeps track of 
the number of students taking advanced* placement examinations In 
various fields, SAT scores of 12th graders and post secondary plans 
of seniors. Blake high school teachers think that there Is a problem 
developing because absence rates seem to be higher. The school 
decides It wants to work on this area specifically. In addition, the 
school Is concerned that It Is not challenging Its top students and 
wants to Improve. Last, the school Is taking the Carnegie Report 
seriously (Boyer, 1983) and wants to assure that Its students are 
competent writers. 

How would our fantasy system work In that environment? Let's 
Just focus on two of the assorted problems. Absence rates need 
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attention. The system collects and sorts net only how many absences 
occur, the rate, but also the distributions, what kinds of students 
are absent, over what broad distribution of time, and over what 
particular days* Proximity of school events (football, dances), drug 
referrals, transiency and school demographics are plotted. It Is not 
problem to sunmroarize these records for the district to consider as 
models for analysis. Obviously, patterns are reviewed, and If, 
hypothetical ly, a clear pattern develops, for Instance, that absences 
are distributed unfairly to new students who avoid school activity 
days, something can be done. 

Second problem, please recall. Is Improving writing. Assume the 
English Department of Blake high school manages to convince the rest 
of the faculty that writing Is something that needed work. Mini- 
mally entered Into the system could be the number of writljpg assign- 
ments received for any given student across classes. I.e., In 
science, with appropriate description (average length, type of 
assignment). In addition. Imagine that the English teachers have 
heroically taught a common scoring system (analytic, of course,) to 
the teachers of other subjects. Thus, data for kids, entered on the 
micro by "computer" kids Includes student code number (for privacy,) 
any scores on task competence, organization, concreteness, ortho- 
graphic conventions, systax, usage, etc. In addition, lists of 
topics, resources, and assignments could be kept. At Blake High, 
"slots" In use Involve across-time traking of absences, with some 
global SES correlates, as well as across time, course, task, skill 
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reporting of writing performance. Based on the baseline, and full 

entries on a 30%, sample, teachers can see that students are having 

difficulty with task directions, that Is, knowing what they should 

write about, rather. than simply problems of expresslpn. Teachers 

decide some explicit prewrlting activities ought to be tried. 

What minimal design modules should such a system have? 

1. A KIOFILE, Including Identification of the student, 
pertinent demographics, existing essential comparable 
scores on standard tests. The Kldflles should be 
aggreeable by grade, SES, absence rate, performance 
Indicators, academic grades, course of study, years in 
the district, etc. 

2. A DATA EHTRY SYSTEM, probable student-user dependent. 

3. A MICRO with a hard disk. 

4. Some PERSON, probably a teacher given one period 
release, to be In charge and to take the lead in 
interpretation. It Is best if this person Is not a 
math teacher; m^be a union leader, someone with good 
personal skills, and an excellent teacher. 

5. A MECHANISM for decisions to be made on what aspect of 
Instruction or program people want to move on, and for 
which they have plausible options. PeopU may choose 
to focus where they suspect problems. Both ««?chanisms 
need to be tied explicitly to data with some Identified 
milestones (time to look, sort, and Interpret.) 

6. A MECHANISM to delete things out of slots and switch 
effort to other areas. 

7. A METHOD FOR REPORTING good works, either good effects, 
or Interesting processes up the line to get credit from 
district. 

Obviously, for this system to work, the larger organizational units 
will need to be responsive and supportive. A school district might 
have to explore how It can reduce the Information burden It places on 
individual schools during the period of early Impll mentation of the 
system. 
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The district must: 

• Provide Incentives, rewards, credit, for such activities, 

• Minimize Its redundancies to use information for personnel 
decisions (move a principal based on* data he/she generated). 

• Protect privacy of school, staff and kids. 

" Try not to add to essential list, without deleting something 
else. Provide a period of safety and protection for system 
trial (pilot) and distribution. 

• Monitor and support. 

What are the research Issues Inherent In such a system? Clearly, 
there Is enough work to supply any Individual's entirely scholarly 
career. Let's take the Idea of measurement constructs and comparabi 
llty as research Issues and explore them In terms of the writing at 
Blake high school. From research, we know that writing performance 
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varies enormously with the task selected and the particular topic 
about which the student has to write. Task differences Include the 
different purposes of writing, often categorized by types of dis- 
course, such as persuasion, exposition, narration, and so on, 
although even these categories have blurred boundaries. Task also 
varies In terms of the audience to whom the writing performance 
reflects general language facility, command of orthographic conven- 
tions, like punctuation and spelling, range and fluency of syntactic 
options, and Individual differences in intelligence, experience, and 
other trait-like variables. Given that the orientation to a school 
level sy$^ Implies that between school differences are large, and 
differences among children are also large „ how could one develop a 
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writing assessment that Is fundamentally valid for the experience, 
.setting and Instruction of Individual children, and at the same time 
can provide a fair and compare! e measure for groups of schools? Do 
we need to provide opportunity to write on the same topic across time 
periods, for longitudinal information? Well, of course, but what 
about practice effects? Do we need to use the same topics across 
' grade levels (to look at growth)? Do we Invoke the same, scoring 
standards for students at different grade level s,l.even .If they share 
the same task and topic assignments? How do we report* cross-school 
comparisons when students at different schools can handle vastly 
different levels of task? How can we go about reporting on writing 
progress overall, without resorting to a general measure that Is 
appropriate to no group assessed? Clearly, the top-down", bottom up 
system Is not an off-the-shelf Item. It Is, however, one technology 
that, with Its underlying theoretical and practical research Issues, 
that may be worth our time. The goal may not to^ulld this system, 
but to use the design problem as a way to shed new^ and perhaps 
creative light on a dark space. 
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Beyond Outcome Measures: An Agenda for School Improvement 

John Good! ad ^ 

I 

Let me begin by tall^g a little bit about the circumstances In which 
Me now find ourselves in the current furor over the reform of schooling In 
the United States. I think It does have to be placed In some perspective 
If we are going to respond appropriately. A good many analysts hive 
pofteed out that the decline In competence In schooling, as well as the / 
Increase In disaffection in schooling that occurred during the decade of 
the 70's is very closely linked with declining faith generally in our 
institutions and with the decline of the econonv that i>egan during that , 
same period. 

I don't think it's any surprise that the release of the report, A_ 
Nation at Risk , last year, had a con^arable effect to the launching of # 
Sputnik in 1957. We had been building up for 1t> If the Nation at Risk € , 

report had not focused our attention on school ii^g there would have been 
some other catalyst. The response was very similar to the response 
following Sputnik: that is, an iimiediate outcry regarding the quality of 
our schools, "the rising tide of mediocrity in our schools. If some other 
nation had Imposed the condition of our schools on us it would have been 
comparable to an ^ct of war." The report goes into a series of very speci- 
fic recommendations regarding a longer school day, more math, more science, 
more technology, more discipline, b^er teachers, and a certain amount of 
pie in the sky, along with a lot of other rather quick remedies. 

Very soon, there was the usual galvanic connecting of achievement test 
scores with school health. That is, there is a rising tide of mediocrity 
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in the schools and the presumed Indicator', In large measure. Is declining 
achievement test scores. Therefore, the Indicator of Improving school 
health will be a corresponding Increase In achievement test scores. 

I would like to submit that achievement test scores constitute a poor 
thermometer for judging the health of schools, just as the thermometer we 
use with human beings Is a poor one for judging the health of human 
beings. Notice the response when a person's temperature rises and we get a 
reading showing 103 or 104 or 105 — there Is the Immediate use of an 
antibiotic. Yet in the most serious illnesses, the closing up of tti^ 
arteries or the beginning of a Cfi\ncerous condition, the thermometer would 
tell us nothing. And you will also note that with a serious heart 
condition or the building up of problems with the arteries, there is always 
a long term cure, a long term preventative, a long term correction of the 

I 

condition. I wouTd like to submit that if the schools are Indeed if-the 
condition of health that many reports are saying they are in, then It is 
going to require a long period of care and attention to put the schools 
Into the health that we would aspire to during coming decades. 

Because of this galvanic connecting of achievement test scores and the 
health of schools, we turn rather Immediately to remedies which turn out 
not to address the health of schools. That is, they do not address the 
quality of educating In schools. And if the thermometers we use do not 
turn our attention to the quality of educating in schools, then the schools 
are not likely to get profoundly better, even if achievement test scores go 
up. And there is no question in mind that achievement test scores in 
coming years will go up. They will go up particularly in the roost 
mechanistic aspects of learning. And because of some of the reforms we are 
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beginning to think a»out. test scores will go up In some of the less 
mechanistic aspects of learning. 

But I'm not at alj sure that the quality of educating in schools will 
correspond to the rise In achievement test scores any more than the quality 
of education could be said to have paralleled the decline of achievement 
t^st scores— about which we" were concerned In the beginning of all of 
this. 

1 don't think If entirely facetious for me to say that when the ' 
reports of 165 additional commissions are In, we already will have seen 
some of the signs of Improvement. And I'm not at all sure that the Imple- 
mentation of the recommendations In those reports will make a very signifi- 
cant difference to the degree to which test scores are going to rise. 

. I made reference to 165 commissions — that's the last report I've 
had. I've had to revise this number almost every time I've spoken on this 
Issue. These are not casual bodies at work; they are state-level commis- 
sions. Most of their deliberations will lead to legislation which will be 
introducded In the sessions of the state legislatures this coming fall. 
However, we need to be aware that there are c6f^d1t1ons having to do with 
the econorw, having to do with the success of other Institutions, and 
having to do with how we feel about ourselves that become Immediately 
reflected in the schools. This does. Indeed, cause us to turn to the 
schools In concern. We've not yet been very successful as social scien- 
tists In Interpreting the reasons for the earlier decline In test scores. 
I doubt that we will be very successful In Interpreting the Increase In 
test scores In the years to come. It's part of the press around us. 

Everywhere I go In the country, teachers are working harder. 
Students are working hard. Some students In high schools are thinking 
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about the law school they're going to attend or the post graduate work they 
are going to do after they complete their baccalaureate. There has been 
that kind of change. I'm not at all sure that it's more of an orientation 
of coming to grips with knowledge, but It's certainly an orientation of 
coming to grips with one's financial future. 

As the test scores go up In the years to come, the rhetoric of 
self-congratulation on the part of those who are now making the 
recommendations will increase. That is, we will begin to adjust the 
rhetoric to the test scores and then say that what we're doing at the 
present time Is Improving our schools. And I'm raising some questions 
about such a connection with test scores. 

Part of what Is needed for a significant in^rovement to occur are 
comprehensive diagnoses of the educational enterprise and the educational 
condition. Yet In spite of all of the reports about schooling, there are 
still relatively few diagnoses. I want to present a perspective on these 
diagnoses and to deal with some specifics regarding their nature. 

Let me turn first to the assessment of state responsibility. What 
should the state be doing at the present time? First of all, I think, 
states should articulate, much more clearly than they are currently doing, 
the comprehensiveness of our expectations for schools. And I don't think 
this is a capricious matter. We have data on these expectations. For 
example, in our Study of Schooling we looked at the expectations for 
schooling from a historical perspective. We analyzed the documents of all 
50 states, we administered questionnaires to 8,600 parents, to 17,000-plus 
students, and to the teachers and principals in our sample. And what comes 
out of these data is that our society isn't backing off from a 
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comprehensive set of expectations for schools; society is stm concerned 
about academic development, citizenship development, vocational 
development, and personal development. 

Further, though Jaroe^ Coleman ha^ been saying In some of his recent 
addresses that we can ho longer think of the school In Its surrogate 
parenting role, uy conclusions are precisely the opposite. With 
demographics changing as profoundly as they «|re (In regard to the support 
of the home and the support of the religious Institutions so far as they 
affect the young) we are expecting more of a surrogate parenting role of 
the schools than perhaps we did before. Those three institutions — 
school, home, religion — joined very closely when I was going to school. 
Now, more and more, we have deep concern a»^out the school. In part because 
of the decline of the other 1nst1tutl0i.». 1*^ ,, Interesting, for example, 
to note the number of parents In our Sample who would opt for prayers In 
the school. And I'm not at all sure that this Is only some kind of 
far-right religious concern representing a major turn In our society. I 
think It grows out of frustration on the part of parents (particularly with 
their young people entering puberty and adolescence) who, not knowing what 
to do about their children and hoping the school can do something, suddenly 
realize that teachers are human too. Therefore, It might not be a bad idea 
to have God In the classroom as well as the teacher, and so prayers become 
a pretty good Idea. That may be an overstressed set of relationships, but 



It's Interesting that only 37 of the 50 states articulate In any 
reasonably clear way the four areas of historic expectations — academic, 
citizenship, vocational, and personal development that have emerged so 
clearly. It's Interesting that California Is one of the states that does 
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not articulate these expectations, but rather states goals In the context 
of the subject fields: education Is teaching math, teaching science, 
teaching reading, teaching 1,1terature, rather than the- using of those 
fields of knowledge for some higher human purpose (In addition to the 
purpose of learning the subject fields). 

So, a state should be held accountable for the clear art1culat|p of 
the expectations which careful surveys show are there. In addition, 
however, the state has a responsibility to define what the so-called coinnon 
school means today. The common school was a vehicle In our society and 
part of Its characterljtlcs were designed to ready students for entry Into 
the labor force. And until the early part of the century, the elementary 
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school was the agreed-upon level of entry Into the labor force and, as 
such, constituted the common school. Today one Is expected, for entry Into 
the labor force, to have matriculated from high school and have a 
high-school -leaving certificate. That means, then, that we should be 
evaluating the success of schooling not merely by the degree to which 
pushout programs (disguised as preparing the young for~jobs, many of which 
are no longer there by the time they are to leave high school) Increase 
your achievement test scores. Unfortunately, If you Increase your 
achievement test scores while your retention rate Is declining, your school 
gets brownie points. 

But how about the criterion that the successful educational 
Institution, K-12, Is shaped like a rectangle? And that the most 
successful school 1^ the one that keeps all those angles at 90 degrees? 
This means that those who begin In the kindergarten graduate with a high 
school certificate. However, the responsibility of schooling is not just 
to keep those young people In, but to assure comprehensive, democratic 
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access to the domains of knowledge that constitute a good general 
education. What a different criterion that would be. What a change that 
would bring about In regard to almost everything we do -in schooling. 

First of all, having a good school, as defined here, would require an 
enormous amount of collaboration among teachers and students. Students 
would have two responsibilities: one, to learn; the other, to help 
everyone else to learn. The best school would be the one that retains 100 
percent of its young within a comprehensive program that we can agree to. 
And that means equity - equl^ **^th respect to knowledge. But when we 
look at our data on tracking. It shows very clearly the disproportionate 
number of poor children In the low tracks and. In turn, the 
disproportionate number of minorities among those children. And when one 
looks further one notices the lack of equity In regard to the content In 
the upper and lower tracks. One also sees the lack of equity In regard to 
th^ pedagogical methods being used In the lower tracks versus the 
pedagogical methods being used In the higher tracks. And when one lo oks at 
teachers' expectations for the higher tracks whlch are clearly higher, 
clearly different, than teachers' expectations for the lower tracks, we 
find a monstrous situation of Inequity, not the equity we wish to see. 

The civil rights movement, once It resurfaces In this land with 
respect to education, will not be fought over access to schools. It will 
be fought over access to knowledge. And we will have to examine with great 
care those practices In schools which we take for granted, but which 
clearly operate against the principle of equitable access to knowledge for 
all within a comprehensive twelve or thirteen-year program leading to a 
high school certificate. 
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We all know the skillful ways In which we can subvert rewards for 
Individual schools because of their gains In achievement test scores. For 
example, you can manipulate scoriBS, either by leaving out groups of 
youngsters In the tests, or by the way you monitor those tests, or 
whatever. We need to pay attention to the work that Peterson Is doing at 
McGIll University right now, where he has begun to doctmient the progression 
of youngsters through their aducatlonal experience. He's going to spend 
twelve or thirteen years of his life at this *« documenting youngsters year 
after year longitudinally. And In talking with him iust recently, he 
mentioned something he found Just legion; the degree to which teachers 
provide subtle clues In walking around the room and watching the response 
of a youngster, to a tests. They say "Kmmm," and the child quickly looks 
again and changes the answer. We have all kinds. of skillful techniques 
when the goal In /mind Is raising achievement test scores on the basis of 
those who are retained (particularly In these upper grade levels) 
rather than the extent to which children do well In a comprehensive 
curriculum and actually stay there until graduation. 

Well, I spent more time on that than I Intended to, but I want to give 
the notion now of how a different kind of quality Indicator could be used 
by the state. I have great questions about the state's concern with 
Individual children, and think the unit of selection for evaluation ought 
to be at a much higher level than that, such as the nature of the total 
program being offered. State responsibility should represent commitment to 
that broad set of expectations, commitment to the scope and breadth of the 
program and equity to which I just referred, and commitment to an 
evaluative framework commensurate with these expectations. And, In 
addition, the state must be commlted to the development of quality 
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expectations In regard to the currlculuffl, Its convletlon by all students, 
and the degree to which knowledge Is humanized within that program for 
equal access of students, m come back to that point when I deal with 
the classroom or the school as the unit of analysis. 

Let' me turn now to the institution — to Institution-based or 
school -based assessment. Let's assume that our conce^^, at least in our 
rhetoric. Is with the quality educating In schools and with the health 
of schooling. L^'s also assume that achleviaraent test scores were never 
Intended to measure the health of schools (some of you may have read, 
recently* the articles In the Los Angeles Times by David Savage and the 
quote from Gregory Amreg, President of Educational Testing Service, who 
says that the SATs were never Intended to appraise the quality of educating 
or the quality of schools). Let roe l>eg1n talking about Institutions by 
referring to Sara Lightfoot's work. 

Sara Lightfoot has published a book called Sood High Schools . It 
consists of portraitures of six schools — two }>r1vate, two more or less 
upper socioeconomic class, and two urban high schools. She Introduces a 
concept of "goodness". It's Interesting to note that her concept of 
goodness deals so much with the degree to which the environment of the 
school supports the quality of living In those schools first , and the 
quality of educating In those schools after the quality of living has been 
raised to a point where Instruction and learning i$n proceed. The schools 
are profoundly different and It's Interesting that the Carver School in 
urban Atlanta Is one of her "good schools." She talks about good "enough;" 
she doesn't say excellent, but good enough to be capturing the attention of 
far more students than was previously the case. And she describes a lot of 
things about that school that would make us wonder, on the basis of some 
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criteria, how In the world that could be a "good school" In her judgment. 
Then you begin to see Carver In some historical perspective, the lack of 
attention to the life of the school before the coming 6f a particular 
principal and a supportive superintendent, and the conditions in the school 
that operated against learning and the progress that had been made during 
recent years. 

Schools have profoundly different cultures. There 1^ no way to 
prescribe details in connon for them. Indeed, In John F. Kennedy High 
School, another public school which Sara Lightfoot described In Mew York 
City, to prescribe In such a way as to seek to Increase the Intensity of 
academic life would simply be to Increase those things In the culture which 
can be seen to be detrimental. 

I urge you to read Lightfoot's book. It Is a sensitive Interpretation 
of life In six high schools. It Is also Interesting', for those of us who 
are Interested In careful methodology, to read her commentary on educa- 
tional research. She has some rather rough things to say about what we've 
been doing In the past, and admittedly she's defending work which she knows 
Is going to be highly criticized In some quarters. Yet, It has not stopped 
her from moving In eight years from an assistant to a full professorship at 
Harvard, and she Is now, being sought after by several of the ma'jor 
Institutions In the country. 

We're beginning to get a different kind of handle on what Is Important 
In schooling and Lightfoot helps us a great deal. As a kind of a side 
comment, I'd like to note what Sara Lightfoot Is talking about (after 
detailed descriptions of her six schools) or what Ted Sizer Is talking 
about In Horace's Compromise (his analysis of teaching In schools and the 
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compromise that Horace had to make). Is miles and legions away from where 
many of the commission reports are landing with respect to Improvement. 

Let me turn rabre specifically to what we might look at If we were 
concerned about the health, the condition of a school, colleagues, 
Leigh Burstein and Kenneth Sirotnik, have been giving cons1deral?le atten- 
tion to contextual analysis of schools, as have other colleagues at the 
Center for the Study of Evaluation at UCU, and I think this kind of wo-k 
is going to be very seminal. Leigh and Ken have done a lot of significant 
work and some preliminary publications are available; it Is well worth 
considering. 

What they're talking about Is getting Into the context of schools — 
the conditions within schools. And when one looks at the cond,1t1ons within 
schools, they take on meaning only as one relates that to ^^lue system. 
And of course, that recognizes the fact that the understanding of schooling 
Is In part a science. Is In part an art. Because when It comes to the 
Improvement of schooling, ultimately, we do that only through the applica- 
tion of norms, the application of values, the application of beliefs. But 
it also helps a great deal to take a look at present conditions. 

For example, the degree to which a school has disruptive problems, the 
degree to which a school is torn apart by problems, can make it almost 
ridiculous to mount a staff development program based upon, say, specific 
cueing techniques for the improvement of instruction. When I visited one 
of the schools in our sample of high schools, getting the contextual sense 
to add to the hard data, I couldn't get to see the principal with whom I 
had an appointment at nine o'clock until eleven o'clock. He was on the 
telephone dealing with the courts all morning. In the mean time, I walked 
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through the school building with the vice principal. This person was Vice 
Principal for Curriculum and and Instruction and I asked him, "How do you 
spend your time?" He said, "Doing what I do now," as he reprimanded and 
separated students fighting In the hallway. He said, "My g'^eat frustration 
is that I came here because ^^s going to be Director of Curriculum of 
Instruction. Now I know w«f I came." And I looked at- him and I knew why 
they had sent him. He was six-foot six, two hundred and thirty pounds, and 
an Imposing figure as he walked through the hallway with another Vice 
Principal for Discipline who was about the same size. As they went through 
the hall, almost his entire time was spent In cleaning up fights. The 
major one that morning was within a group that had come In from the outside 
fighting with the students In the hallway. As we went around to the 
classes, they didn't bother to separate the children In the Industrial arts 
classes, for example, to go Into Instruction with other children. They 
simply moved from working In the shop Into algebra and mathematics and 
whatever, and the environment hardly changed. The conversation went on, 
the disruption went on, and one had to say that those children and those 
students were In no danger of learning anything that the school was trying 
to teach. 

Sara Light foot points out as well that these are the problems that 
have to be addressed first. And so. In getting an assessment, In 
evaluating. If you will, the quality of life (what ^s the condition of the 
school environment) we discovered In our study of schooling the range of 
serious problems ran from none to a couple of problems, rated by teachers, 
parents, and students, as only mildly Important or difficult, all the way 
to a school that had twenty-five problems fhlch teachers, students, and 
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parents rated as very serious*^ Jdhere do you begin Improvement In that 

latter kind of school? Do you say "We're going to have a staff development 
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program to Improve Instructional methods?" when the teachers aren't even 
conversing or communicating with rowdy, unruly students? Of course not. 
You begin where the culture of the school Is. What are some practical 
problems you look at? How about time use, for starters, now that we're 
getting so much research on the Importance of tliqe In schools? We 
discovered In elementary schools with roughly the same length of school 
day, some children being In danger of learning what the school was trying 
to do for only eighteen and a half hours a week, and at another school 
children having 50 percent more Instructional time, or twenty-seven and a 
half hours a week. I — with some of my colleagues — was one of the early 
pebpW~%rTpeak with the National Commission on Excellence In their 
hearings. At their first morning of hearings, during the fifteen or 
twenty-minutes 1 had for testimony, I said, "I hope that one of the things 
you will not do Is recommend Increasing the length of the school day." 
Well, so much for expert testimony. 

My reason for that was the clllnate of the school with eighteen and a 
half hours a weel^— an obviously careless one with respect to the use of 
time (slow getting started, tardy children getting tardier while they 
waited to see the principal, recess stretching from fifteen minutes to 
thirty minutes, lunch hour dragging through much longer than was intended, 
and good old clean up time). There are enormous differences In the use of 
time, and these are clearly cultural pifoblems In the school environment. 
These problems need to be addressed by the parents, the teachers, the 
students, under the leadership of the principal. In order to get enough 
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time to have a comprehensive curriculum. In analyzing our data, we 
concluded that In our sample of elementary schools, children were In 
instruction during the week for an average of twenty-t*o and a half hours. 
I looked at that time figure and laid it up against a model of access to 
the domains of knowledge In the elemenUry school and concluded that 1t was 
not enough. I didn't recommend a longer school day. I recommended that 
the local school work on that problem because they had enough time If they 
didn't spin so much of It. With twenty-five hours a week, for example, * 
you've got ninety minutes a day of reading/language arts, an hour a day of 
math, fifty-five minutes a day of social studies, fifty-five minutes a day 
of science, three art periods a week, and health and physical education 
every day. With only eighteen and a half hours a week, you've got ninety 
minutes a day of reading/language arts, an hour a day of math, twenty-three 
minutes a day of social studies, thirteen of science, no art, and not much 
physical education or health. -With twenty-seven and half^ hours, you've got 
the curriculum I just recommended and a lot more. 

How about school climate? Do we not htve climate Indicators that we 
could use to determine, for example, what Is valued most In the school 
culture? Friendships? Athletics? Smart students? Classes? Teachers? 
Or drugs? Alcohol? Games? Sports? Etc. As you know, the Select 
Committee on Education In Texas has been tackling this with a meat ax. 
They have concluded that there will be no athletics during the week — this 
a recommendation coming fr^oro the legislature this June — no athletics 
whatsoever on any weeknight, and no school -sponsored activities after six 
o'clock In the afternoon. They have prescribed a whole array of things 
because they're concerned that there is so imich that is not close to the 
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learning process. Some of this bothers me a good deal, because I think It 
Is possible, using time well, to have a comprehensive curriculum wherein 
students who are In vocational education programs may be getting the satis- 
faction and stimulation they need to perform in some of the other areas. 
Moreover, vocational education programs may be the entry Into mathematics 
and science and the like for students who are getting turned o f. Notice 
that I used the word "education," however, and did not use the word 
"training." I'm talking about the kind of thing that John Dewelir was doing 
with woodworking In his laboratory school at the University of Chicago. 

I want to go on In this assessment. How do we get at the prlnclpal- 
teacher relationship? And then, from the research on effective schools and 
from elsewhere, what might be that roost effective kind of relationship? 
It's very Interesting that when we divided our schools up Into the most 
satisfying quartlle and the least satisfying quartlle (using an Index of 
satisfaction based on data from teachers, parents, and students) « every 
singles elementary school principal In the least satisfying quartlle said 
the teachers are part of the problem. In the roost satisfying quartlle, I 
believe only one principal said that the teachers are part of the problem. 
I don't think that these were profoundly different people. And inciden- 
tally, when we looked at the correlations among satisfaction, school cli- 
mate, class climate, principal -teacher relations, school -coBiminlty rela- 
tions, and the like. It Wis very clear that the most satisfying schools had 
a bond of trust and support and' a working relationship between principals 
and teachers that was quite different from those In the bottom quartlle. 

Having assessed these things In the environment of the school, one 
still does not have a program of In^rovement. But now one can bring to 



bear the value system of the professionals In the school, as well as 
Interested citizens who are brought Into what Bruce Joyce calls "the body 
of responsible parties'*. They can then begin to engage In long-term 
planning by saying: What Is our f4rst agenda Item, second agenda Item, 
third agenda Item? And that becomes the agenda-of Improvement for the 
local school, approved by the superintendent and the board, and supoorted 
by them. This would result in such a different environment for school 
Improvement than what Is usually the case. 

During the last fifteen months or so, Tve had an opportunity to take 
another look at Edmonton, Alberta where, nine years ago, the superintendent 
of schools and the board Introduced what I'm talking about — a planning / 
process with "every tub with Its own bottom." Responsible parties at the 
level of the local school engaged In assessing their needs (in a primailve 
fashion, admittedly, because we don't yet have the technology) and came up 
with priorities. They were able to sit down. In a non confrontational 
situation with the superintendent ajid the board, to review what It was that 
they were about and what they wanted to do. And they went about getting 
the endorsement of the superintendent and the board, getting differential 
support, getting funds for what they wanted to do, and then going about the 
business of doing It and reporting their progress the following year. 

When I was back there a year ago, Edmonton had just been through a 
severe budget cut coraparabU to the budget cuts that have occurred In some 

* 

districts around us«^ I expected to walk Into a terrible morale situation 
— teachers upset, principals upset — a real downer. However, I walked 
Into a very positive situation because here the superintendent and the 
board had called In all the principals and said. "We have to do a budget 

40 



17 

cut of so much percent. Go back and revise your plans and see what you can 
do about cutting." All those principals came back several months later. 
They'd revised their plans; not only had they effected the budget cut, they 
now had a surplus. And then they asked the question, "May we keep it?" 
How foolish the superintendent and the board would have been had they not 

* 

so permitted. 

How different this is from a board of education obsessed with Its 
In^ortance, tearing Its hair at one-thirty In the morning, reporting to 
workers how tired they are because they were fulfilling their responsibili- 
ties to the local community the night before, and slashing whole chunks out 
of the school program to nobody's satisfaction. In contrast, the smooth 
and morale-building process that occurred In Edmonton permitted, low and 
behold, good morale while effecting a budget cut I This, I think. Is about 
the ultimate In concept. They were a long way, however, from being able to 
do this In a precise way, because we don't have the Instrimients, we don't 
have the technology, and we won't get them until we're concerned about such 
assessment. 

Let me conclude with some brief comments on Instructional assessment. 
Every bit as Important, perhaps more important than whether or not a 
teacher produces attainment on an achievement test score. Is the matter of 
whether or not a teacher In the classroom provides the students with an 
array of learning experience commensurate with our expectations for school- 
ing. Do children ever engage In solving real problems? Do they ever have 
to work on a problem where there Is no reward, until every member of the 
group has done his or her part, with or without the assistance of others, 
and the entire task is done by the group? After all, building that kind of 
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collaboration Is the way we work ID many aspects of life. In spite of the 
fact that our expectations for schooling talk about learning cooperative 
behavior, what we find In most schools is anything but that, ^e find from 
the beginning that learning In school has been learning alone In groups. 
To what degree do students do anything that requires some kind of response, 
some kind of product that's not preordained by the textbook or the work- 
book? To what degree do youngsters engage in modes of Inquiry commensurate 
with what we think learning Is? I'm not going to pursue this any further 
because one of the speakers will be doing that today, I'm sure, but I want 
to touch Just briefly on the notion that there are more things to evaluat- 
ing the effectiveness of the teacher than the product of achievement test 
scores. 

What about class climate? Does class climate reflect what we know 
regarding human learning? We know a great deal, and clearly we won't re- 
flect all of It. But Is there some reflection there of what we know? One 
of the things we discovered In our studies Is that there Is very little 
variation used by teachers In the mechanics of teaching. The technology 
doesn't differ much from class to class to class. It gets to be terribly 
dull and boring as Kenneth Sirotnik has pointed out in his paper recently 
in the Harvard Education Review . But we did find that the climate sur- 
rounding this pedagogy differed quite markedly In the classroom. And, con- 
sequently, that there were classes that had more guidance, with the feed- 
back that Is one element of good teaching, as many researchers propose. 

And then» finally. In this area, what about the assessment of the 
students' own experiences with school? What about those declining academic 
self-concepts where many students by the fourth grade are clearly saying. 
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Tin not doing very well In school. I don't do well In mathematics and I 
don't feel very good about that." And then the need to recognize the 
change from focus on the school and focus on the subject that some of the 
tenth grade students in our sample Indicated by saying, "Sometimes I don't 
feel good about nyself at all." Is this the product we want of schooling? 
Isn't It Interesting that we couldn't ferret out many differences In 
attitude towards school Itself between those who were adjusting well and 
those who weren't? But we could Identify the feeling of turning on 
oneself. What a marvellous job we've done of placing this Institution In a 
high level of significance so that the Individual says, "Hy failure Is'due 
to nyself, and I don't feel good about nyself at all." 

What about students' academic self-concepts as they move through 
different schools? What about the curricula that students actually 
experience, not the curriculum that's offered? What about criterion- and 
domain- referenced measures that will tell us the growth that students have 
made In writing a paragraph from the t1ir» they're nine until the time 
they're twelve? Or what about our concern with the fact that students' art 
products seem so Imaginative and creative at the age of five and six and 
seven, and then s^em to get so stereotyped as they move on upward? What 
about taking a look at those developmental kinds of things? 

And clearly. If we're going to get a handle on schools and their 
Improvement, If we're going to have schools and educational systems that 
are healthy a decade from now, we're going to have to take a longitudinal 
view. We're going to need entry measures. Where's the school jiow In Its 
health? Where's the state now In Its articulation of goals? Where is the 
state now in the degree to which it is encouraging the development of 
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assessment Instruments that get at all, the goals of schooling and not just 
the mechanics? And then, what Is the progress, whatever the criterion, 
that students have made over a period of time? Again, I refer to Sara 
Lightfoot (because she has made such a profound Impression on me) and note 
the extent to which she assessed a school not In the light of 
now/cross-sectlonal/lmnedlate measures but In the light of the history of 
that school: what was It doing now to become a better place for learning 
than It had been the year before? 
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Using Educjitlonal Evaluation for the Improvement of 

California Schools 
Elliot Eisner 

1 would like to start out by clarifying what I think evaluation means 
In the context of education. I think the Idea of educational evaluation 
often gets confounded ^Ith a llf st of other concepts that really obfuscate 
Its meaning and confuse both professors of education and practitioners. We 
tend to mix up the notion evaluation with the notion of measurement; we 
tend to confuse testing with measurement and evaluation. What I would like 
to do In this presentation Is to sort out these concepts, because I don't 
regard them as being Identical at all. 

Measurement Is a way of qualifying Information according to some 
convention, some standard; It does not make a judgment about quantity. If 
I say, for example, that this room Is larger than that hallway to which It 
is adjacent, I am making a descriptive claim that talks about quantity, and 
that descriptive claim Is based upon nny estimation, my appraisal, ny judg- 
ment of space. But, In no way Is It a measurement of the space that is out 
there and the space that Is In here. For me to measure this room means 
that I have to eMoy some kind of device with conventional Indices that 
represent the space that I occupy here, that this room represents, and that 
a hallway represents. Measurement Is a way of quantifying Information; It 
Is a way of quantifying Information according to some conventionally 
defined metric. A meter Is a bar of metal kept In Paris and It def1ne,s 
what amount of length a meter Is. It Is arbitrary. We could define It In 
many other ways. 

It Is possible to measure things without evaluating them. I could 



measure the length of this rooin. the width of this room/ and the cubic 
space In this room without making a value judgment about whether this Is 
good or bad, or Indifferent, or appropriate or Inappropriate. I could make 
a measurement of this room to determine how uaich carpeting I need In the 
room. This Is a description of a state of affairs, It Is not an evalua- 
tion. I can stvid on a scale In the morning and 1 can measure ny weight, 
and If I say, "Oh, Oh," then I am evaluating. But, If I simply want to 
know ny weight, I am using that measurement In order to get information. 

Evaluation has to do with making value judgments, value judgments 
about something that we care about. In educat1o;i we care about educational 
processes and the conse<|uences of those processes. Educational evaluation 
has to do with applying educational criteria to a state of affairs so that 
we can make some appraisal and assign some value to what we see occurring 
or to Its results. So, when we evaluate we make judgments about the value 
of something on the basis of some criteria. The criteria that I employ to 
evaluate wine are not the criteria that I employ to evaluate classroom 
pr^rtice or Its consequences. I use the criteria out of the wine industry 
0 my experience as a wine connoisseur (of which I am not). When you 
make an educational judgment, educational value judgments, about the 
quality of your schools and the quality of your teachers and what they are 
doing, etc., you are applying educational criteria. And with respect to 
educational criteria, there are a wide array of differences as to what 
constitutes virtue In education. The criteria Issue Is Itself a debatable, 
discussable Issue, and It has been discussed for over 2.000 years {and I 
don't expect it to cease this afternoon). 

Testing Is not evaluation; it Is simply one way of getting Informa- 
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tlon. It Is very often i way of getting Information that you could get In 
other ways If you waited for It. The use of testing Is a way of construct- 
Ing a sUuatlon, creating a device, typically, that elicits a response 
which can >e jieasured. Further, we can engage In educational evaluation 

. \ 

(and we certainly do engage In evaluation In the course of our lives) with- 
out using weasurenient and without using tests. For example, you folks are 
evaluating what I am Saying to you. You are making judgments about its 
clarity, about its cogency, 'and about its relevance, and there is nobody in 
this room who is giving me a test! That Is, I am engaged In a perfor- 
mance. I am providing information and you are making an appraisal of it. 
And if people start dozing off, I will get some feedback. If people start 
walking away, I will make some judgments about m performance and I'll 
start to do something else. 

The first thing that you ought to recognize, if you do confound test- 
ing, measurement, and evaluation, is that these are three independent 
processes: We can evaluate without giving tests; and we can test without 
measuring; and we can measure wihout evaluating; and we can evaluate 

without measuring. 

What about testing? in evaluation? Whether a test or a measurement Is 
an appropriate vehicle for securing information about which you can make 
value jud^ents educationally, is partly a technical problem. But there is 
no question in my mind that the use of tests (which often are confounded 
with educational Valuation and which people see as the only legitimate way 
to evaluate educational practice and its effects) does in fact have an 
affect on the educational priorities and the educational climate of 
schools. 

Consider, for exan^le. the headline in a relatively subdued. 



relatively conservative newspaper: "Seniors' Scores Drop In Statewide 
Testing!". Let me read you three paragraphs* "California high school 
seniors dropped again this year on the average in a statewide assessment 
test, but educators on the Peninsula believe that their students improved., 
on last year's scores. While the scores of the individual high school 
districts on the Caltfronia Assessment Program will not be released until 
May llth, statewide results were reported this week to the State Board of 
Education. All seniors fn California high schools were required to take 
the 30 minute tests. They scored 62.2% correct in the reading category, a 
decrease of 0.9% from the previous ij^ear; 62.6% in writing^ a decline of 
0.4%; 69.4% in spelling, a drop of 0.1%; and, 67.4% in mathematics, a 
decrease of 0.3%." Now, people tend to read headlines. Those headlines 
begin to set up expectations. And, interpretative information, 
particularly for test information, is not provided. 

As another example, consider thll array of North County Elementary 
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Schools by district, showing grade 3 and 6 academic achievement test scores 
in a three year comparison. Teachers and parents look at these indices and 
they make juddments about the quality of education on the basis of the 
information that is very often rank ordered, out of context, without 
interpretative informationi That kind of Information gradually begins to 
affect what school teachers teach and what administrators are urged to pay 
attention to., And that kind of information has consequences for the kinds 
of reforms that are being implemented in schools - reforms that are by and 
large described on the basis of achievement tests that are often developed 
outside of the school context and which may or may not have much curricular 
validity. 

We have been doing a study of high schools during the past two years 
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In the Bay area at^Stanford.^ and there are some manifestations that we see 
when we look at classrooms. We are not looking at classrooms by going Into 
them for a 45 minute visit with an observation schedule; we are trailing 
kids In schools, we are shadowing them for a two-week period. The research 
assistants In this project go to school with the youngster and they stay 
with that youngster for one full week, one week off and one week on- So. 
they shadow youngsters from Monday. 8:00 o'clock In the morning through the 
entire day. Very often they stay with them after school In order to get a 
sense of the quality of teaching, a sense of what's going on In cUssrooms 
and a sense of what kind of expectations are provided, etc. We do the same 
thing with teachers. Our research assistants go to school with teachers 
and they will spend a full week In their classrooms. I dare to say there 
is nobody In this room who Is a school administrator who has spent one full 
week in a teacher's class. In one of the districts we are associated with, 
four teachers In a high school have been released by the superintendent to 
trail or shadow students In their own school. So. for the first time, 
after teaching In the school for 20 years, teachers are having access to 
their colleagues classrooms; and for the first time they are getting a 
vision of the nature of that environment, the cdwon place that school is 
for the kids they are working with. And. this has proven to be an 
extremely niwnlnatlng experience because it allows us to §et a fresh 
perspective. 

One of things that we see Is a good deal of currlculwn fragmentation. 
When a multiple choice or short answer test Is to be used. It Influences 
the ways In which the students prepare and the kind of Information that 
teachers give to the students and the ways In which teaching takes place. 
We are seeing teachers who torpedo their own lessons. They do a very nice 
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job of teaching 1n the course of the period; but near the end they remind 
students what Is going to be on the test, giving them the Implicit message 
that the rest^of what they were paying attention to Is. not really 
Important. That Is of grave concern. If you have a vision of education 
that Includes a great deal more than what tests assess (and It certainly Is 
a vision that I have), then we need to recognize the Influences testing has 
on Instructional practice - for example: reducing the currlculwn to small 
units of Instruction; developing accounting procedures to record student 
asslgments; maintaining records that objectify scores at end of the 
semester thereby depersonalizing education and "permitting" the teacher not 
to be responsible for making a perso^nal judgment (or a professional 
judgment) on the work that a youngster has engaged In. 

We see a great emphasis on the use of extrinsic rewards for the work 
that has been produced, that Is, coimminl cation to youngsters that what 
really counts Is getting a positive payoff on the basis of performance. We 
all want positive payoff; the question Is what kinds of "payoffs"? Are we 
doing the sorts of things In schools, for example, that will enable 
youngsters to Internalize what they are studying so that what they study In 
school become a part of their cognitive and affective repertoire, enabling 
them to use the ideas and the skills that they get in the context of 
classrooms and In situations that extend well beyond the classroom? 

What I think Is extremely Important In terms of educational evaluation 
(and that has the potential to Improve the quality of schooling Is the 
examination of classrooms as they operate In the context of schooling. 
Consider curriculum as an Intention, something that you organize as a body 
of content - a set of activities In an order, for example. In which they 
are to flow. If you think about the curriculum, in other words, as plans 
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for action, as bodly of materlab then that bo4y of material can be 
evaluated In Its own right. One can pay very close attention to the 
educational substance of what Is being Intended in the. classroom. You can 
look at a science curriculum, you can look at an art curriculum, you can 
look at a history curriculum, and you can make (If you have the ablllly tp 
do this substantive judgments about the power of those Ideas, about their 
importance within that discipline, and about whether these are the 
significant notions that kids ought to be exposed to. How many youngsters 
in your high school districts would be able to provide a decent explanation 
for the notion of random mutation and nature selection? Could they take 
that Idea and apply It to the social world as well as the biological 
world? Do they see the relevance of this notion In terms of their 
understanding of biology? Is that a part of what they encounter In the 
courses that they take? It may very well be that they do. Hy point here 
is that the plans that we make for teaching, the curriculum that we design, 
the concepts, the generalization, the sorts of activities that are going to 
engage youngsters In at schools, can Itself be an object of evaluation. 
And. If that program has Insulated teacher from teacher, that has^created 
conditions In which It Is very dlfflcult'^for the people who teach to learn 
about what they are up to as teachers. Most of you folks here have gotten 
out of teaching to become school administrators, perhaps because of the 
descretlonary space that became avvallable to you as a school a<ftn1n1 strator 
but that you were denied when you were a teacher. A teacher goes to school 
at eight In the morning and she or he Is with those youngsters until the 
end of the day. 

We have created a structure which makes It very difficult for teachers 
to understand and to get feedback on how they do their business. Consider 
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the following thought experiment: If you were to conjurfjip a system that 
would ;rncrease the probability that there would be no growth In teaching 
over^the course of a career, what features would you generate In your mind 
t/ Increase that likelihood? What would you do? Well, one of the things 
/that you might do Is to create no Incentives for being excellent In 
teaching. You might make sure that teachers got virtually no useful 
feedback about what they are doing. You might create Infrequent, 
In-service education programs, removed from the shcool and taught by people 
who haven't crossed the threshold of the schooV themselves for a decade. 
Then you might think you will do your duty to Inspire teachers in your 
district by Inviting John Goodlad or Ellolt Eisner or somebocly like that, 
to give heartfelt speeches to Jack them up In September so that they can 
carry themselves through June. In other words, I am suggesting to you a 
hypothesis. The hypothesis Is that after teachers aquire the skills 
necessary to maintain the classroom and cope with the predictable crises 
that emerge In classroom, after two or three years In the classroom, growth 
in teaching Is relatively flat. We have not provided the conditions In our 
schools to enable people to do better at their Jobs. Yet we seem to pursue 
the Idea that somehow we can hiARlllate practitioners Into excellence by the 
publication of the performance of their students. This seems to me a wrong 
headed way to go about the Improvement of California education. % 

So what Is needed? We need to face up to the fact that we need to 
restructure opportunities In schools for teachers and administrators to 
learn what It Is that they are doing In schools In their classrooms. I 
think we need to conceive of the evalua tor's role as an educational role. 
That is, educational evaluation can Inform teachers about what is subtle, 
but significant In classrooms. To accomplish this we first have to achieve 



a set of conditions In schools that will de-lsolate teachers from each 
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other so that they have access to each other. Secondly, we need to 
establish a climate of trust In schools where people are wllltng to make 
themselves vulnerable to the observations of their colleagues. It means 
that we need to prepare school administrators and teachers In a way that 
win enable them to become connoisseurs of educational practice, because 
the presence of an Individual In a classroom Is no guarantee that they In 
fact will see what Is Important In that classroom. And the development of 
our ability, to perceive the subtle but significant events that take place 
In school is a necessary condition to being able to provide feedback to the 
people who work in classrooms, so that their own activities ^s teacher can 
change. We need, I think, to develop a language of description that is not 
limited to quantltatlvejnformation. I think there are wonderful uses of 
quantitative Information for some sorts of things, but not for everything. 

Think about the wide range of forms through which we represent the 
world. We represent the world discursively, we represent the world 
poetically, we represent the world figuratively, we represent the world 
quantitatively, we represent the world visually, and we represent the world 
kinesthetlcally. Our culture and pur cognition are much wider than the 
vehicles that we use In schools to represent what we se^. We first have to 
see, we have to pe»-ce1ve. we have to penetrate what is going on In 
classrooms. But we need the leeway and the space to Inform people who 
operate In schools (and who formulate educational policy) as to where the 
problems are and where the achievements are. 

Many of the things that we are seeing In the schools are extraordinary 
In terms of their achievement; we are seeing some marvelous teaching. We 



• / 

- 10 . 

are not, however, seeing as much of it as we would like. What strikes me 
In looking at schools (and In reading the case studies that our research 
assistants ar© producing) Is the extent to which worse. than mediocre 
teaching can continue year after year. I wonder frankly what the 
administrators are doing about this, and I wonder who is paying the price 
for this mediocrity, and I wonder why it is allowed to continue.. I have no 
conviction or belief that the publication of text scores will Improve the 
situation, because what these lacking teachers need is mre subtle, it 
is much more supportive, and it is much uore complex. 

What is needed is a conception of inserv1(;e education that does not 
send teachers to service stations two times a year, but which builds in the. 
concept of inservice education as an ongoing part of what it means to be a 
professional teachers. How can we construct schools so that the Inservice 
part, the learning part, is part of what it means to be there? €Sn we 
create places for teachers so that you would be happy to say to your son 
or your daughter, "Yes, be a teacher, it's a fine thing to do, it will not 
thwart your growth, you can use every capacity that you have, the tbp is 
unlimited." Can we create places like that so that we don't have 
reservations about it? 

I got out of it. I taught in a school, and I looked at ny colleagues 
after two years of teaching in a high school of 3600 students, and, I said 
to nvself , "I don't want to be in their place 25 years from now." So, I 
found a place where I had more space. And, so did most of you. We are not 
going to Improve the educational lives of youngsters until we are able to 
provide more professional space for teachers. I don't think schools are 
going to be any better for kids than they are for the people who teach 
them. And the problem is to construct this kind of professional 
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envlronraent. The problem Is to design that structure and to communicate to 
people who have simple Ideas about the In^roveroent of education that those 
well Intentloned plans may Infact exacerbate the problem rather than 
ameliorate It. 

Unfortunately, however, we are voiceless. Both professors are 
voiceless and school administrators are voiceless. Professors have a 
lessor right to be voiceless, but we are. We tend to be preoccupied with 
technical matters. And you are utterably vulnerable. When I talk about 
educational evaluation In schools I don't mean having a resident 
educational critic who goes around to classrooms and writes educational 
criticism. The model In my mind Is to create school environments In which 
teachers can have access to each other and supportive and Informative 
colleagues. How can you do that? What kind of substitute help can you 
provide to alleviate teachers of some of their responsibilities so they can 
have access to each other? What kind of climate of deliberation can you 
create so that people understand how It Is that they are teaching? Look. I 
have been teaching since 1956. And I have been teaching at Stanford since 
1965. You know. In the 19 years (or whatever It Is) that I have been at 
Stanford, not ever has there been a colleague of mine that has come into ny 
classroom to watch me teach! Mot ever has a peer told me what I'm doing 
and what I'm not doing. I mentioned this to an audience once and one of 
the people In the audience said. "Well, Professor Eisner, why didn't you 
ask?" Why didn't I ask?. ..I didn't think of It. And the fact that I 
didn't think of It says a lot about own professional socialization. It 
is not a part of what we do. You see, dancers have mirrors In the rooms in 
which they practice. Why do they have mirrors? Because they get 
information about how they move. Where are the mirrors in our classrooms? 
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The reflections In the students eyes are not good enough. And what we wind 
up with Is trying to figure o|t (on the way home or on the way to work) how 
It went and why It didn't go as well as It did, or If it went well, why It 
went so well. And we never know whether what we think Is what in fact tocic 
place. We haven't created a structure for It. 

So, what an I saying to you? I am saying to you that I think we have 
grossly underestimated what It is going to take to Improve California 
education. We cannot bully the schools Into quality education. We need to 
give people a stake In what It Is that they teach. The good school will 
expand individual differences rather than diminish them. And, we need to 
have programs which are diverse and whicif use multiple criteria even If it 
makes situations which are Incommensurate. We ought not to allow a 
technology of testing provide ceilings on our aspirations and our 
Intuitions and our insights. And, we need to create a climate of 
education, a structure of schooling in which the growth of the teacher is 
possible. Because it is through the teachers growth, through the teachers 
growing capacities, to appreciate what he or she is doing that the 
opportunities for educational experience are going to be defined for the 
young. Unless and until We face up to that task, we are going to be 
reeling from one mandate to another, making accomodations that deal with 
the superficial. An then years down the pike our successors will be doing 
the saroething unless we face up seriously to what is needed in schools. 

Teachers need to have a stake in their own operations and their own 
professional commitment. They need the time, they need the resources, they 
need twelve months' salary to plan, to deliberate; then they need an 
afternoon in which they can think with others about what's happening. They 
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need access to each other. We very badly need to find ways to convey to 
the public what It Is we are achieving and what It Is that we are not 
achieving, that we would like to achieve. I hope that. a group like this 
could be the start wf something that might be called "California Coalition 
for Quality Education" that would find the voice that I think Is now absent 
In California education. I think we need In this state a group that can 
appropriate mandates for Improvement of educational practice. I think we 
need to create a vehicle that In some way restore to our profession some 
modicum of authority and control within the districts for which we have 
responsibility. That's going to be hard when 80% or more of the funding is 
coming from someplace else. But, I think that Is what Is needed. 

Some may view my Ideas as Impractical, but It strikes me that the 
greatest Impractical Ity Is to embrace procedures which In the long haul 
won't work even If they are "superficially practical." I would rather 
reach for something that 1 don't believe In, In order to accomodate the 
expectations of others. You have a very difficult task ahead of you. 1 can 
only wish you well In your effort and say to you that as far as I'm 
concerned, I am prepared to provide whatever assistance, whatever voice I 
can. in what we all know Is perhaps one of the most Important enterprises 
in the state. Thank you very much. 
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The Influence of Testing on Teaching and Learning 
Norman Fr^derlksen 

m 

Speech given at a confefence sponsored by the 
Laboratory In School and Community Education and the 
Center for the Study of Evaluation of the Graduate 
School of Education, University of California at Los 
Angeles, June 7, 1984. 

In the first part of talk, I'm going to argue that most current 
standardized achievement tests have serious limitations with regard to the 
skins and abilities they measure, and that these limitations may similarly 
limit what Is taught and what Is learned In school. I believe these 

effects are becoming more serious because of the growl n^use of 

f 

standardized tests In school assessment, particularly the use of 
state-mandated minimum con^etency tests that are Intended to set higher 
standards for promotion or graduation. I shall review some of the evidence 
to show specifically the what, how, and why of these effects. 

' In the second part, I plan to describe some testing methods that do 
not use the muTtlpTe-chbrce for^ that might get at «b11 1 ties thdt -are 
not adequately assessed by most standardized tests. I shall al'so jdescrlbe 
how tests that allow the student to write his/her own answers might be 
scored more accurately and economically than the usual essay test and how 
such measures might be used In schools to facilitate and Improve the 
instructional process by encouraging the generalization of skills to new 
contexts and situations. 

There is little question that tests do Influence what is taught and 



what Is learned. The mere expectation that a test will be given tends to 
Increase efforts to learn. Furthermore, the student's preparation for a 
test will be guided by his or her expectations as to what will be required 
by the test. That Is the reason students often ask "What will the exams be 
like?" Students adopt different study methods for different test formats; 
If a multiple-choice test Is expected, they will try to le^rn factual 
material, and If an essay test Is expected, they will be more Inclined to 
look for broader concepts and their relationships. Such differences In 
study methods are educationally Important, and the net effect may be 
substantial. In view of the huge number of mult1ple>cho1ce tests that 
students are required to take nowadays. 

The number of multiple-choice tests given to school children each year 
has grown enormously over the past 25 years or so. Almost all the 50 
states now Have testing programs of one kind or another, and they typically 
Involve multiple-choice tests. The number of published tests, such as the 
Iowa, California, and Stanford achievement tests, that are administered 
each year Is estimated to be about 30 million. Furthermore, no one knows 
how many locally constructed multiple-choice tests are given as weekly 
quizzes and midterm and final exams each year. 

The trend "pward usfilg tesTs to Tibld scftobi s accountable l)as Increased 
the influence of tests still more. In a school accountability feedback 
loop. Information about a school Is communicated to the school's 
constituencies— parents, potential employers, and even legislative bodies. 
Feedback to the school takes a variety of forms; parents can complain to 
the principal, employers can write letters to the editor, and the governing 
body can enact legislation. The loop is completecl when the school 
administrators respond to the feedback by altering the curriculum, 
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retraining or reassigning teachers, or asking for more money. A good many 
state legislatures have In fact enacted laws nwndatlng the u$e of minimum 
competency tests In order to set higher standards of achievement in school. 
As a result of such pressures, scores on minimum coirq)etency tests. have 

m 

been on §ie rise. In ny state of New Jersey, scores on the Minimum Basic 
Skins tests have Increased slowly but steadily over the past few years. 
It Is easy to see why. If you study the legislation. The program In New 
Jersey requires that rosters of test scores be released to all school 
districts, buildings, and classes, and that Individual score reports be 
Issued to students and their parents. General reports in the press are 
mandated. A list of the skills measured by each test Is sent to teachers, 
and they are encouraged to use this information In their teaching. Old 
forms of the tests' .are made available for "appropriate Instructional 
purposes"— which might turn out to be coaching. Schools falling to meet 
standards are subjected to review, and recommendations for remediation are 
prepared. If accountability feedback Idbps are not working In New Jersey, 
It is not the fault of the legislature. 

Any in^rovements In the basic skills that result are, of course, much 
to be desired. My concern, however. Is that the Increased effort to teach 

the minimum conn)etency skills decf^ases-ef^orts-ta tea ch imp ortant 

abilities that tend not to be measured with imiltl pie-choice tests. 

A recent report of the National Assessment of Educational Progress 
(NAEP) suggests that there Is Indeed such an effect. NAEP's 1982 report 
showed that over the most' recent decade performance on test Items that 
measure the basic skills had not declined, but there had been a gradual 
decline in performance on items that measure more complex cognitive 
skills. For example, in mathematics it was found that 90% of 17-year-olds 
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could handle simple addltldn and subtraction; but for Items that required 
problem solvlngXthe decline was from 33% to 29%. Similar results ;*ere 
found for sc1ence\read1ng, and writing. In the case of writing, 75% of 
the 17.year.olds couVd write sentences with few mechanical errors, but for 
writing tasl^s that required analytic and logical skills, the^ percentasie of 
writing samples judged to be "competent" dropped from 21% to 15% over the 
lO-year period. 

Please understand that I am not trying to discourage the use of tests 
to influence Instruction. On the contrary, I am all in favor of using 
tests to motivate and guide learners and their teachers, and even to 
provide practice, ^ut we should be using tests that measure not only the 
basic skills but also the ability to process information rapidly and 
accurately, to apply principles in ncw situations, and tp solve problems In 
forms they have not ^countered before. Use of such tests, I believe, 
would help to improve Instruction broadly, not ,3ust tn the very basic 
skills that are easy to measure with multlple-choice tests. 

Anyone who has prepared a multiple-choice test for a class must 
realize that it is indeed much easier to write items based on factual 
information involving names, dates, definitions, and formulas, than items 
requiring more-complex cognitive operations. However, there have been few 
careful studies of the influence of test format on the behavior of the Item 

writer. I can cite two. 

One such study involved one of the Graduate Record Examination Board 
tests, the Advanced Psychology Test, which is a multiple-choice test given 
to college seniors who are applying for admission to graduate school. 
Members of a panel of 5 psychologists were asked to make a judgment about 
the kind of ability predominantly Involved In responding to each Item. 



Definitions of four abilities were provided: memory, comprehension, 
analytic thinking, and evalualon. Memory was defined as "simple 
reproduction of facts, formulas, or other items of remembered content." 
The consensus of the judges was that a large majority— 70%-of the Items 
were In the memory category, while 15% measured coraprehenrlon. 12% required 
analytic thinking, and only 3% involved evaluation. And this was a 
professionally made test that Is widely used In admitting students to 
graduate schools. 

Another study was based on a multiple-choice test Intended to measure 
competence In orthopedic medicine. Judges were trained to sort the Items 
Into categories similar to those used In the SRE study. It was found that 
more than half of the Items were unanimously judged to require only recall 
of Information, while fewer than 25% were believed by even one judge to 
require Interpretation of data, application, or understanding of a 
principle. An effort was then made to Improve the next test by training 
the Item writers to write Items requiring the more complex cognitive 
processes. It was found that 50% of the Items in the new test were still 
judged to require only recall of information. 

_ These s^^les^ suggwt thatjh difficulty PL"'"P'^.!^i'l"^l*^^ 
Items that measure skills other than remembering may be a major reason for 
the tendency of multiple-choice tests to emphasize mastery of factual 
material . 

Another line of research Is concerned with how and to what extent 
taking a test Influences student performance. Numerous studies have 
demonstrated that the expectation of a test Increases test scores, and that 
taking a test tends to increase retention of the material tested. These 
effects are quite specific to what was tested; however, there Is little 
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generalization. There/ Is some evidence that free-response formats, such as 
short answer or completion tests, are somewhat more likely to Improve 
retention. But such differences are not dramatic. 

Other researchers have used the technique of Inserting test-like 
questions In assigned readings. These studies confirm the finding that 
answering questions improves subsequent performance. But only factual 
Items or questions were used In these studies. 

Other studies of the effects of Interpolated questions In text are 
more Interesting because the effects of different kinds of questions were 
compared. One kind of question required verbatim recall of material In the 
text, and a second kind required more complex mental operations such as 
applying a principle In solving a problem. It was found that when the 
questions required, the students to apply principles and to combine concepts 
and rules In solving problems subsequent performance improved substantially 
and generalized to new situations; and performance on verbatim questions 
did not decline. 

A third line of research Involves comparing tests presented In 
free-response and multiple-choice form with regard to the kjnds of 
abilities they require. In such studies researchers have typically begun 
by choosing multiple-choice tests and then constructing parallel 
free-response tests by removing the multiple-choice options and replacing 
them with blanks In which students can write their answers. Then both 
types of test are given to samples of students, and various kinds of 
statistical analyses are made to find out If format makes a difference in 
what the tests measure. Several such studies have shown that format makes 
little difference. 

Such research may be criticized on the grounds that the comparisons 
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Involved only Items that already existed In multiple-choice form. Parallel 
studies are needed where we begin with free-response tests intended to 
measure higher level cognitive abilities and construct parallel tests In 
multiple-choice form. Such comparisons have been made by several of us at 
ETS. 

We began with a test we call Formulating Hypotheses. The problems 
were of a kind frequently faced by scientists. Each problem consisted of a 
brief description of a research stu4y, a graph or table showing the 
results, and a statement of the major finding of the study. For example. 
In one problem a table showed that habitual users of marijuana Improved In 
their visual-motor coordination after smoking a marijuana cigarette, while 
nonusers showed poorer performance. The task was to write hypotheses, or 
possible explanations, of the finding. Multiple-choice forms of such 
problems were constructed by providing a list of hypotheses from which the 
student could choose those he/she considered Important. Scores were, 
obtained that reflected the quality, number, and unusual ness of the 
hypotheses that were written, or those that were chosen from a list. 

It was found that correlations between corresponding scores for the 
two formats were very low. For example, for scores reflecting quality of 
the Ideas the .correlation between formats was .18, and for number of ideas, 
the correlation was .19. It appears that the two formats do not measure 

the same abilities. 

In order to find out more specifically what abilities were Involved by 
the two formats, the relationships of the scores to measures of several 
known abilities were investigated. These abilities included reasoning, 
verbal ability, knowledge of the area relevant to the problem, and 
Ideational fluency, which may be Interpreted as skill in searching for and 
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retrieving relevant Information stored In iweraory. The most striking 

difference Involved Ideational fluency; none of the scores from 

I 

multiple-choice versions was related to fluency, while for the 
free-response form scores reflecting number of Ideas, number of unusual 
ideas, and number of Ideas that are both unusual and of high quality were 
substantially related to fluency. Only the free-response form required a 
broad search of long-term memory for relevant ideas. 

A similar study was carried out with a more elaborate problem-solving 
test that requires the student to go through a number of stepsMn seeking a 
solution to a problem, beginning with formulating hypotheses. Then the 
student ISfe asked to Indicate what Information he or she needs In order to 
test the hypotheses or to suggest new ones. Then new Information Is 
provided, and the student revises his list of hypotheses. The student goes 
through half-a-dozen such cycles until he or she finally decides on a 
solution. Again, It was found that for problems posed In free-response 
form, the ability most involved Is Ideational fluency, with reasoning 
involved particularly at steps where Inference Is required; for the 
multiple-choice format these relationships were all substantially lower. 
Thus, it again appears that the multiple-choice format does not require the 
same skills as the free-response f^^ 

The research I have briefly reviewed I think supports three 
conclusions about how test format influences behavior: 

First, test format Influences the kinds of items that a test maker 
writes. Because it is much easier to write multiple-choice items that 
measure factual knowledge, the item writer tends not to write Items 
that measure skills In analysis, problem solving, application of 
principles, and the like— even when they try hard. 




Second, tests do Influence student performance. If the free- response 
tests are adaptations of niuUlple-cholce tests, format makes only a 
small difference. But evidence from studies of the Influence of 
questions Interpolated In text Indicates that questions that require 
complex cognitive processing. In contrast with factual questions, do 
Improve performance on subsequent tests, and there Is transfer to 
other kinds of problem- solving tasks. Similar results might be 
expected for Items Incorporated In tests. 

Third, research on the Influence of format on what abilities the test 
measures Indicates that format makes little difference If one compares 
multiple-choice tests with their free-response counterparts. But If 
one begins with free-response tests that require complex cognitive 
processing and compares them with similar tests cast^ln 
multiple-choice form, format strongly Influences what is measured. In 
particular. Ideational fluency Is Important onT^ If the student has to 
compose his/her own answers rathet>^than choose them from a list. 
Now let me turn to the second part of iny talk by/ considering some 
alternatives to multiple-choice tests. I shall conment first on essay 
tests, and variants of essay tests that can be scored more objectively. 
Then I shall consider a variety of testing procedures that have little 
resemblance to conventional tests. 

We are so accustomed to multiple-choice, true- false, and completion 
tests that we seldom consider other possible formats. The usual 
alternative Is an essay test. But^teachers don't like essay tests because 
grading Is onerous and time consuming, and test publishers don't like them 
because they can't be scored with a machine. Another problem Is low 
reliability of grading. In one study, 300 essays written by college 
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freshmen were graded Independently by 53 experts, including English 
teachers, editors, writers, lawyers, and scientists, using a 9-po1nt 
scale. It was found that no essay was given fewer than 5 of the 9 possible 
ratings, and 34% of the essays were given all 9 of the ratings. Essay 
grades may depend inbre on who reads the essay than on who wrote It. 

One way of achieving higher reliability Is to use several readers 
Instead of one and to pool their Judgments. Since this Is a pretty 
expensive way to grade essays, a method called "whollstic" scoring has come 
to be used. In this procedure the essay Is graded quickly and 
Impresslonlstlcally by two or more readers. This brings down the cost, but 
It Is certainly not possible to state very precisely what the grade means. 

No method of scoring that Involves people rather than machines can 
compete with the fflult1ple-cho1c| test. But there are methods of evaluating 
written protocols that may turn out to be faster, less expensive, and more 
reliable than the usual method of grading essays, and. the method can ^ 
provide not one but a number of scores that have very precise meanings. 
Such methods would not work very well for such essay topics as "How I spent 
my summer vacation," but they would probably work for assignments that are 
well structured In the sense that all the students are attempting to 
accomplish the same task by more or less similar procedures. 

The Formulating Hypotheses test that I described earlier is an 
example. I mentioned the names of some of the scores, but I did not 
describe the scoring procedure. 

We call the method category scoring . Several preliminary steps are 
required. Tlie first step is to develop a classification of the ideas 
produced by a sample of students in response to the problem. In the case 
of Formulating Hypotheses, these ideas are the hypotheses that the students 
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thought might account for a research finding. Our procedure for 
classifying responses Is to copy each hypothesis on a 3x5 card, then sort 
the cards Into piles that contained Identical or closely similar Ideas. 
Initial agreement among sorters was generally quite good, and after 
discussion a consensus was reached on the number and nature of the 
categories. Then a definition was written for each category, trying to 
differentiate clearly each category from all the others. 

The next task Is to ask a panel of judges to make an evaluation of the 
quality of each response category. Then a quality value Is assigned to 
each category on the basis of the combined judgments. 

The scorer's tasPis then comparatively simple— to read each 
hypothesis and match It to one of the categories. Scorers do not have to 
be experts to do this. After a reasonable amount of training and practice, 
agreement between scorers Is good. The category assignments are entered 
into the computer, along with the quality values and Information about the 
frequency of occurence of each category. A variety of scores can then be 
generated. We used scores that reflected the number of ideas written, the 
number of good Ideas, the average quality of Ideas, the number of unusual 
Ideas, and the number of Ideas that were "creative" In the sense that they 
were both unusual and of high quality. 

It Is also possible to ask the panel to make other judgments about the 
Ideas as a basis for additional scores. For example, a hypothesis might 
have been directly suggested by Information in the problem statement, it 
might have resulted from Inference based on such Information or. If It was 
unrelated to any Information given, it must have come from a search of 
long-term memory. Scores to represent the number of ideas from each source 
can easily be generated by the computer. 
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There are many possible applications of category scoring. We have 
used It to score medical problem- solving tests, which are paper-pencil 
simulations of a doctor's encounter with a patient, as well as for other 
tests of scientific thinking called Solving Methodological Problems and 
Evaluating Proposals. Although the method works best when the problem 
constrains all the students to respond In ways that are roughly similar, 
the method might be applied even to essays. If the topic assigned Is very 
clearly specified. A content analysis of a sample of essays on a given 
topic might reveal a common core of Ideas, and relationships among Ideas, 
that could be sorted Into categories, which In turn could be evaluated and 
used^as the basis for a scoring system. 

The test could be used for instructional purposes by having each 
student score his or her own protocol. When the problem is completed, the 
student coiild be given the category definitions and told to match his or 
her responses to the categories. Then feedback could be given In the form 
of the quality values, along with a critical statement of the good and poor 
features of each category. If a large enough number of such test problems 
1s available, a substantial amount of practice could be given, and If the 
problem settings are realistic and varied, such practice should p'-omote 
generalization and encourage learning by discovery. 

So much for scoring the protocols of free-response tests. Now let us 
consider some ideas for testing that grow out of theories of cognition. 
These ideas are quite different from conventional tests. You may even 
think some of them are wild Ideas. 

One such Idea has to do with measuring speed In performing cognitive 
tasks. The trend over the past 20 or 25 years has been toward power tests 
as opposed to speeded tests. The most important reason is probably the 
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desire to be fair. The student with a low score who could have gotten all 
the Items right If he/she had had more time may feel that he/she was 
cheated. Actually, we do not have to choose between speeded and power 
tests— we can give both. Let me explain why I think it Is Important to 
measure speed as well as power. 

The reason that speed Is Important has to do with certain attributes 
of memory. Cognitive psychologists distinguish several kinds of memory, 
but I will discuss only two, called long-term memory and short-term or 
working memory. 

Long-term memory is the limitless and relatively permanent repository 
of one's knowledge. It contains a huge amount of Information, Including 
knowledge of procedures as well as facts and their relationships. We are 
not aware of any of this Information, however, until some part of It Is 
transferred to working memory. Working memory contains the Information we 
are aware of and are actively using at a given time. The terra Information 
processing refers to the flow of Information Into and out of working 
memory, by such processes as retrieving Information from long-term memory; 
receiving sensory Inputs; comparing, combining, and transforming Items of 
Information; and placing new or altered Information back In long-term 
memory. 

An Important feature of working memory Is that It has very limited 
capacity; It can accomodate only six or seven Items of information at one 
time. Any Information above this limit crowds something out, as you know 
If you have ever taken a digit-span test. This small capacity Imposes a 
serious limitation on one's ability to deal with complex problems. But 
since we are able to deal with complex problems, there must be ways to 
compensate for the limitation. This Is where speed comes In. 

One method of compensating Is called automatic processing. With a 



great deal of practice, It Is possible to carry out mental activities 
automatically, without paying attention and without using up the limited 
capacity of working memory. An example Is one's ability to drive a car 
along a familiar route while carrying on a conversation with a companion. 
An example from the school room Is the ability of a skilled reader to 
decode the symbols that oi^prlse a word automatically, without paying 
attention, and thus withoit interfering with his or her ability to deal 
with more complex aspects of reading. Similarly, the mathemat1''an can 
carry out elementary algebraic operations automatically, without attention, 
while attending to his more remote goals in solving the problem. 

How can we measure the development of automatic processing skills? 
Cognitive psychologists assess automaticity by measuring latencies, or 
reaction tiroes, in responding to simple tasks that are components of more 
complex skills. For example, a microcomputer might be used to present a 
list of words one at a time to a student, and to measure the latencies as 
he/she responds by saying each word as quickly as he/she can. Individual 
differences in latencies on such tasks may be substantial, even among 
students who make almost no errors in saying the words, and they 
discriminate between good and poor readers. 

A simpler method of measurement that might be just as good, from an 
instructional point of view, is a paper-pencil test containing a long 
string of orthographic symbols, some of which are words and others 
nonwords. The task might be to mark as rapidly as possible the symbols 
that are words. The last item attempted before time is called would be the 
score. . Similarly, tests might also be used to measure speed in carrying 
out other con^onent tasks, such as filling in blanks to Indicate the 
antecedents of pronouns used in sentences. 
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Automatic processing, then, Is one way to compensate for the limited 
capacity of working memory. Another method, which Is closely related, has 
to do with pattern recognition, which Is the ability to perceive a pattern 
of related parts quickly and accurately. Like automatlclty, 
pattern-recognition skills are acquired only through a great deal of 
practice. A chess grandmaster can look for a few seconds at a chessboard 
wul the pieces in a midgame position and then rejjroduce on another board 
the i^sltlons of the 25 or 30 pieces almost without error. Ordinary 
players given the same task can place correctly only 5 or 6 pieces— a 
number which Is more cons1sten| with what we know about the capacity of 
working memory. What the grandmaster perceives Is 5 or 6 chunks or 
clusters each of which Is a pattern ot^r 6 related pieces. 

Similar results are found In other areas of expertise. Electronics 
experts can quickly Identify the patterns In a circuit diagram that 
represent the €fWments corresponding to the power supply or a stage of 
^an^llflcatlon, and experienced physicians can recogrrpse In a case workup 
the pattern of signs and symptoms that correspond to a diagnostic category. 

How can we measure pattern-recognition skills In schools? As in the 
case of reading, both speed and power tests are desirable. Power. tests are 
especially Important at early stages In acquiring a skill, to find out 
about the nianber and kind of patterns that can.be recognized. Speeded 
tests are Important at later stages when through practice recognition Is 
becoming automatic. Methods analogous to those used In assessing 
performance on the components of reading could be used In other areas, such 
as recognizing geographical features from contour maps. Identifying organic 
compounds from representations of carbon chains, or locating body lesions 
from X-ray photographs. 
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Another testing Idea suggested by theories of cognition has to do with 
how one represents a t>roblein Internally. Such a representation may take 
many forms. A word problem^ In mathematics, for example, may be seen by 
various students as a set of verbal statements, a chart or diagram, an 
equation or set of equations, or a procedural flow chart of some sort. An 
Inadequate representation may make the solution of a problem difficult or 
lir^sslble. How can we find out how a particular individual represents^ 
p^lem? ^ ^ 

This Is a difficult questlo^ to answer because probl em solvers usually 
don't know how they represent a problem; therefore, It must be Inferred. A 
research method that has been used by cognitive psychologists Is to present 
to students with fairly large set of ^j^blems from some domain, such as 
physics, and to ask them to sort the problems Into sets that are similar 
with respect to how they are solved. Striking differences between students 
and experts are^ found. Students tend to sort the problems on the basis of 
surface features, such as pully arrangements or weights on Inclined planes, 
while experts sort on the basis of the physic^a-4)rinc1ples that are 
involved, such as Newton's third law. Tests based on such a procedure 
might reveal something about a student's stage of development In forming 
useful representations of problems. 

Another important factor In problem solving is how information is 
stored in long-term memory. This is Important because good organization of 
the stored Information facilitates retrieval and enhances the likelihood of 
seeing interrelationships among the stored items of Information. Making a 
test to determi- how information is stored would appear to be ImpOQSible; 
^ but a beginning has been made. The method is to find out how key concepts 
Mn an area are interrelated by a given individual. In the area of 
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'mechanics, for example, there are a dozen or so Important concepts. 
Including mass, density, velocity, acceleration, force, and so on. One 
can present to the student all the possible pairs of these terms and ask 
him to make a judgment about the strength of the relationship between the 
members of each pair. A statistical method, such as tmjltl dimensional 
scaling, can then be used to produce a cognitive map showing the dimensions 
of the system and their Interrelationships. Such a picture could be 
compared with the analogous structure based on the judgnents of experts. 
The cognitive map presianably reflects the student's understanding of a 
large Interacting system of concepts at a certain phase In his learning, 
and It could be compared with similar representations obtained at earlier 
and later stages. 

By way of summary: I have described several possible testing methods 
that with further development might be used to replace or supplement 
multiple-choice tests. Two of the Ideas are concerned with skills that 
help one to compensate for the 11ml tec capacity of working memory; they are 
the automatic-processing and pattern- recognition skills. I suggested that 
it would be relatively easy to measure automatic processing skill In a 
' particular areas of expertise, such as reading, by using speeded tests with 
relatively easy items. I believe It would be quite feasible, also, to 
measure skill In pattern recognition by similar methods, although we may 
need nwre Investigation to Identify the patterns that are salient for a 
particular area of Instruction. I consider this kind of testing to be very 
Important, because these are the skills that make It possible to attend to 
the more complex aspects of a problem or a situation without getting bogged 
down in the detail. 

Methods for measuring how problems are represented Internally and how 
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Information Is organized In long-term memory are also potentially 
Important, but at the moment such measurement methods may be more Important 
for the researcher than for the educator. 

I also described something I called category scoring, which may make 
it feasible to use tests that elicit fairly lengthy written responses. It 
has been demonstrated that the method works quite well for devices like the 
Formulating Hypothesis test, but we need to find out to what extent It can 
be adapted to other formats. I consider this to be a very Important 
development If It encourages teachers to assign more tasks that require 
constructed responses. 

Another way of appraising the new test Ideas has to do with the 
coachablllty of the tests. Some are conchable In the bad sense that 
coaching may Improve the test score without Improving the ability measured 
by the test. For example, students could be taught a "correct" cognitive 
map without altering the actual knowledge structure. But other tests may 
be coachable In the good sense that coaching for the test would also 
Improve the ability measured by the test. I consider this a good feature 
of a test because the test can then be used as an instructional tool to 
provide the practice and feedback that are so necessary for learning. 

It has been argued by Walter Doyle that tasks are the basic^ treatment 
units of a school, and greater emphasis should be given to task assignments 
such as writing papers, solving homework assignments, and takipg tests. If 
the tasks are properly designed, they could help students to acquire not 
only the krwWledge base but also the information-processing skills that are 
necessary for developing high levels of proficiency In thinking. 

I suggest that the primary purpose of tests, tasks, scorable 
exercises, or whatever you want to call them, should be to provide practice 



with feedback to students and diagnostic Information for teachers. Taking 
such tests or exercises should be dally occurrences rather than something 
that happens once or twlc^ a terra for the purpose of assigning grades. 
Properly designed materials would help students not only to acquire 
competency In basic skills, but also to acquire high levels of proficiency 
In pattern recognition, automatic processing, and other Information- 
processing skins that make It possible for students to advance to higher 
levels of accomplishment. And If the tasks assigned Involve a wide variety 
of realistic contexts and situations, proficiency may generalize to the 
difficult real -life problems that will/ arise In the future. 

All this may strike you as fine; but who 1^. going to pay for It? It 
Is certainly true that the tests I described cannot be scored at the rate 
of 10,000 answer sheets an hour. But I have a few suggestions that wight 
help In terms of costs. One Is that some yf the tasks could be programmed 
for microcomputers, so that the computer could give the test, score It, and 
even provide comments and suggestions to the student. Another Idea is that 
students might score their own tests, for prompt feedback. Another Is that 
the material that Is most costly to prepare could be provided by 
consortiums of school people and testing organizations for use on a wide 
scale. Finally, if we consider the a(toi1n1 strati on and scoring of tests as 
Instruction rather than assessment, the cost may not ^eera exhorbltant. And 
the usual testing For grading and assessment purposes can be dropped 
because better Information will bl available as a by-product of 
Instruction. 



